Overview of the Full-Text Document Retrieval Benchmark
نویسنده
چکیده
8.1 Introduction For most of recorded history, textual data have existed primarily in hardcopy format, and the related document retrieval process was essentially a manual task, possibly involving the assistance of cross-reference catalogs. By the mid-1960s, work was under way at the University of Pittsburgh to develop computer-assisted legal research systems [Harrington, 1984–85]. Also, during this period of time, computer-based document retrieval systems were beginning to emerge in commercial firms; for example, InfoBank at the New York Times [Harrington, 1984–85]. The most distinguishing characteristics of such systems include full-text Boolean search logic and support for proximity expressions (e.g., phrases). With this technology, termed full-text retrieval (FTR), documents are selected from a database in terms of content, rather than with predefined keywords or subject categories. For example, suppose that we were interested in locating articles about benchmarking full-text document retrieval systems. To formulate a search expression that would specify the desired content, we could select keywords (e.g., benchmark, performance) and phrases (e.g., document retrieval, full-text retrieval, information retrieval) which would likely be found within relevant documents. The reader should note that this simple example illustrates one important shortcoming of FTR systems: The inherent ambiguity of natural language makes FTR query formulation imprecise. Although FTR systems lack closed methods to formulate
منابع مشابه
A New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملDocument Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملESM-IL: Entity Extraction from Social Media Text for Indian Languages @ FIRE 2015 - An Overview
Entity recognition is a very important sub task of Information extraction and find its applications in information retrieval, machine translation and other higher Natural Language Processing (NLP) applications such as co-reference resolution. Entities are real world elements or objects such as Person names, Organization names, Product names, Location names. Entities are often referred to as Nam...
متن کاملVisualizing the Evaluation of Distance Measures
This paper describes the development and use of an interface for visually evaluating distance measures. The combination of multidimensional scaling plots, histograms and tables allows for different stages of overview and detail. The interdisciplinary project Rule-based search in text databases with nonstandard orthography develops a fuzzy full text search engine and uses distance measures for h...
متن کاملConnected Component Based Word Spotting on Persian Handwritten image documents
Word spotting is to make searchable unindexed image documents by locating word/words in a doc-ument image, given a query word. This problem is challenging, mainly due to the large numberof word classes with very small inter-class and substantial intra-class distances. In this paper, asegmentation-based word spotting method is presented for multi-writer Persian handwritten doc-...
متن کامل